{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [ Chapter 10 - Learning to Rank for Generalizable Search Relevance ]\n",
    "# Setup TheMovieDB Collection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from aips import get_engine, get_ltr_engine\n",
    "engine = get_engine()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Collection\n",
    "\n",
    "Create collection for http://themoviedb.org (TMDB) dataset for this book. We will just look at title, overview, and release_year fields."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wiping \"tmdb\" collection\n",
      "Creating \"tmdb\" collection\n",
      "Status: Success\n",
      "Adding LTR QParser for tmdb collection\n",
      "Adding LTR Doc Transformer for tmdb collection\n"
     ]
    }
   ],
   "source": [
    "tmdb_collection = engine.create_collection(\"tmdb\")\n",
    "get_ltr_engine(tmdb_collection).enable_ltr()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download and index data\n",
    "\n",
    "Download TMDB data and index. We also download a judgment list, labeled movies as relevant/irrelevant for several movie queries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GET https://github.com/ai-powered-search/tmdb/raw/main/judgments.tgz\n",
      "GET https://github.com/ai-powered-search/tmdb/raw/main/movies.tgz\n",
      "Successfully written 65616 documents\n"
     ]
    }
   ],
   "source": [
    "from ltr.download import download, extract_tgz \n",
    "from aips.data_loaders.movies import load_dataframe\n",
    "import tarfile\n",
    "import json\n",
    "\n",
    "dataset = [\"https://github.com/ai-powered-search/tmdb/raw/main/judgments.tgz\", \n",
    "           \"https://github.com/ai-powered-search/tmdb/raw/main/movies.tgz\"]\n",
    "download(dataset, dest=\"data/\")\n",
    "extract_tgz(\"data/movies.tgz\", \"data/\") # -> Holds \"tmdb.json\", big json dict with corpus\n",
    "extract_tgz(\"data/judgments.tgz\", \"data/\") # -> Holds \"ai_pow_search_judgments.txt\", \n",
    "                                  # which is our labeled judgment list\n",
    "\n",
    "movies_dataframe = load_dataframe(\"data/tmdb.json\")\n",
    "tmdb_collection.write(movies_dataframe)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next Up, judgments and feature logging\n",
    "\n",
    "Next up we use a _judgment list_, a set of labeled relevant / irrelevant movies for search query strings. We then extract some features from the search engine to setup a full training set we can use to train a model.\n",
    "\n",
    "Up next: [judgments and Logging](2.judgments-and-logging.ipynb)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
